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1. Introduction 


This report presents an algorithm for the solution of multiple kernel learning (MKL) prob¬ 
lems with elasticm et constraints on the kernel weights. Please see ISun et al.l (j2013l i and 
Yang et al. ( 2f)ll]') fo r a review on multiple kernel learning and its extensions. In particu¬ 
lar lYang et al.l (j201l|) introduced the generalized multiple kernel learning (GMKL) model 
where the kernel weights are subject to elastic-net constraints. 


While Xu et al. ( 20101 ) presents an elegant algorithm to solve MKL problems with 
Li-norm and Lp-norm (p > 1) constraints, a similar algorithm is lacking in the case of 
MKL under elastic-net constraints. For example, algorithms based on the cutting plane 


method (jYang et al.l . 120111 ) require large and/or commercial libraries (e.g., MOSEK). 


The algorithm presented in this report provides an extremely simple and efficient solu¬ 
tion to the elastic-net constrained MKL (GMKL) problem. Because it can be implemented 
in few lines of code and does not depend on external libraries (except a conventional L 2 -norm 
SVM solver), it has a wider applicability and can be readily included in existing open-source 
machine learning libraries. 


1.1 Notation 

The symbol M+ denotes the set of Q-dimensional vectors of nonnegative real numbers, while 
the set of vectors of strictly positive real numbers. The curled inequality symbols {e.g., 
y) represent componentwise inequality. The symbol 1q (Oq) denotes a Q x 1 vector of all 
ones (zeros) while is the vector with all entries zero except the k-th, which is one. The 
expression aob computes the componentwise product of the vectors a and b. The notation 
9k refers to the A:-th component of the vector 9 while 0^”*^ indicates the value of the vector 
9 at the m-th iteration of an iterative algorithm. For simplicity of notation, all summations 
involving i go from 1 to Y (the number of training instances) while those involving i or k 
go from 1 to Q (the number of kernels). 
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2. Elastic-net constrained MKL problem 
2.1 Formulation of the generalized MKL problem 

Given a set of labelled training data V = {x^, yt}f=i where ^ X and ye G {—1, +1}, the 
learning problem c orresponding to a generalized MKL classifier with elastic-net constraints 
(jYang et all . 120111 1 can be formulated as 


1 

minimize 

0G0,bGR, 2 




Gk 


Vi 


( 1 ) 


where T-Lk is the reproducing kernel Hilbert space (RKHS) associated with the A:-th kernel, 
L(-) is the hinge loss function, and 


Q = {eeR^:rj\\e\\e + il-v)\\9g<l} 


( 2 ) 


represents the elastic-net constraint on the kernel weights, with p arameter rj G [0,1]. When 
9k = 0, fk must also be equal to zero (iRakotomamoniv et all . 120071 ) and the problem remains 
well-defined (under the convention 0/0 = 0). Note that the minimization problem in ([1]) is 
a convex optimization problem b ecause: a) the function to be minimized is jointly convex 
in its parameters 9, {fk}, and b ( Rakotomamoniv et ah . 200?! ): and b) the search space 
convex, in particular the elastic-net constraint 0. 


IS 


2.2 Two-step block coordinate descent algorithm 

The approach taken in this manuscript for the solution of ([T|) consists of a two-step block 
coordinate descent alternating between the optimization of the SVM classifiers and the 
optimization of the kernel weights. The procedure, which is reported in Algorithm[Tl iterates 
until a stopping condition is met (see Section [2^ . 

At iteration m, the first step minimizes problem ([T|) with respec t to {fk} and b for fixed 


values of the k e rnel weights As p reviously noted by others ( Bakotomamonjv et al. 


2007 : Xu et ah . 201ol : Yang et ah . 2011 1. this problem is equivalent to the standard SVM 
problem with a composite kernel ■). Given the stack of Gram ma¬ 

trices Gk//>, where Gk//' = Kk{xe,xei), existing SVM solvers can efficiently solve the 
composite SVM problem with Gram matrix 9^^Gk and return the optimal bias 

and vector of dual coefficients 

The second step consists in minimizing ([T]) for 0 G 0 while keeping b and {fk} (or 
equivalently the dual coefficients) constant. Since the only term that depends on 8 is the 
regularizer, we can define 


oM _ 
Pk — 


fk^'^l 




[(«(”") o yY G^^^ (a(™) O 


y)], Vfe, 


(3) 


and attack this sub-problem as an instance of the more general problem of minimizing a 
weighted sum of reciprocals bound to elastic-net constraints: 


g{m+l) ^ 


argmin > 
eee V 


Pk _ 

9k 


(4) 
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Algorithm 1: Solve elastic-net constrained MKL. 

Function SolveElNetMKL(Gi,..., Gq, y) is 
// Initialization 

0 1q/'S(Iq); 

for m •(— 1 to maximum number of iterations do 

// Step 1: optimization of the SVM classifiers 

build composite Gram matrix: G •(— Ylk^^Gk] 

solve std SVM with G, y to get optimal dual coeffs and bias: a, b; 

// Step 1.5: check convergence 

for A: ■<— 1 to Q do u^ <— [{o(o y)'^ Gk (a o y)]; 

compute objective function ([1]) from dual form of SVM: O <— IgO; — u'^0; 
solve elastic-net constrained LP: 6 •(— SolveElNetLP(u) ; 
compute lower bound of ([I]): O ■(— IgO — 
if 0/0 — 1 < Cmkl then break; 

// Step 2: optimization of the kernel weights 

compute II/fc IP: /3 •(—0 o 0 o u; 

solve elastic-net constr. weighted sum of recipr.: 6 •(— SolveElNetWSR(/3, 9); 
end 

retnrn 9, a, b] 
end 


Assuming positive definite kernels and excluding degenerate cases causing = Oq {e.g., 
all examples belonging to the same class), we have that 0 as long as 9^"^^ >- 0. 

Because (j4j) diverges to -|-oo when any 9^ approaches zero, the minimization of ([4]) will 
always produce 0 as long as 0, i.e. ultimately provided that the initial 

point 9^^'> >- 0. 

In the special case rj = 1, the elastic-net constraint reduces to a las so constraint and the 
problem Q has a straightforward closed-form solution ( Xu et ah . 2O10l l. In this manuscript, 
a novel, simple and efficient algorithm for the solution of this optimization problem in 
the general case rj G [0,1] is presented. Since the proposed solution to this sub-problem 
represents the novelty and main contribution of this paper. Section [3] will be entirely devoted 
to explaining this algorithm in detail. 


2.3 Lower bound and stopping condition 

Establishing a lower bound on the optimal value of the cost function ([1]) provides a non- 
heuristic stopping criterion for the two-step block coordinate descent algorithm. Following 


(jYang et al.l . 


ping Cl 

mm, 


the lower bound is found as the minimum over 9 of the dual form of 


m-- 


minimize In a- (a 

0G0 ^ 2 ^ 


vV ( '^0kGk ] (aoy), 


(5) 


where a is the vector of dual coefficients of the composite SVM problem. In Yang et al. 
(|201ll l. this bound is obtained as part of the cutting-plane method used for the optimization 


3 





















of the kernel weights. The method proposed here takes a radically different approach as it 
finds the point 9 where the minimum of ([5]) is attained as the solution of the elastic-net 
constrained linear program: 

maximize vJO, ( 6 ) 

eee 


where 

Uk = {aoyYGk{aoy), \/k. 


(7) 


A novel, simple and efficient algorithm for the solution of © is provided in Section [H 
At each iteration, problem Q is solved for the current iterates and The 

current value of the objective function and of the lower bound are simply computed as 







0M = iT a^rn) _ 1 


The two-step block coordinate descent algorithm terminates when an iterate with relative 
gap — 1 < cmkl is produced, which guarantees that the current value of the 

objective function 0 ^™^ is at most eMKLO^°°^ away from the optimal value 


3. Elastic-net constrained weighted snm of reciprocals 


This whole section abstracts from the original MKL learning problem and focuses on the 
solution of the following optimization problem: 


minimize 


A 

k 


subj.to r/ll^lli-t (1 - i?)|| 6»||2 < 1 , 
0^0 


(9) 


with P y 0 and y E [0,1]. As mentioned before, a solution to this problem must lie in the 
strictly positive orthant 0 0. Furthermore, the solution must be attained at a point where 

the elastic-net constraint is tight, i.e. i/||0||i -|- (1 — :??)||0||2 = 1- Aiming for a contradiction, 
let us assume that 0 minimizes ([9]) with i?||0||i + (1 — ??)||0||2 = 1 — e, 0<e<l. The 
point 0^ = (1 -|- |) 0 clearly decreases the cost function while still satisfying the elastic-net 
constraint: r/||0'||i-h (1 - i?)||0'||i < {1 + ^)y\\e\\i + {1 + le){l - y)\\9\\l < (1e) (1 - e) < 1. 
This contradicts the original assumption that 0 was a minimum for ([9]). Therefore, we can 
search for the solution to ([9]) among the points in for which the elastic-net constraint 
holds with equality. Please notice that in the following of this section x and y simply denote 
vectors in (rather than training instances and labels like in the previous sections). 


3.1 Re-scaled objective function 

As a preliminary step in attacking the problem ([9]) , we introduce an equivalent optimization 
problem. It is easy to verify that the norm 

s(a:) = I Iklli + + (1 - ??) Iklli (10) 
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verifies ??||^^||;^ + (1 ~ '^) 11^^112 = 1) ^ 2 : G \ {0}. As a result, the change of variable 
9 = x/s{x), transforms the original problem Q into the following equivalent one: 

minimize h{x) = s{x) g{x) (11) 

a;GR^ 


where 


9{x) = 


A 


( 12 ) 


This new optimization problem implicitly accounts for the elastic-net constraint by means 
of the rescaling function s which re-normalizes any x G such that the vector 9 = 
x/s{x) satisfies the elastic-net constraint with equality. Our new task is therefore to find a 
global minimum of h in the positive orthant. Although h is not a convex function, we can 
prove a weaker result — pseudoconvexity — which is still very use ful in practice because 
critic al points of pseudoconvex functions are also global minima (jCambini and Marteinl . 


20081 . theorem 3.2.5). In order to show that h is pseudoconvex, the following theorem and 


its corollary are introduced (proofs in Appendix lA.ll) . 


Theorem 1 Let A C R” be an open convex cone and g,s : A ^ R 4 _,. be differentiable 
convex functions such that s(cx) = cs(x) and g{cx) = g{x)/c for all c G R+.^ and x G A. 
Their pointwise product h{x) = s{x) g{x) is a pseudoconvex function in A. 


Corollary 2 Under the conditions of Theorem{l\ all points x = cx* — with c G R++ and x* 
satisfying Vs{x*) = —Vg{x*) — are global minima for the function h, where it takes value 
h{cx*) = s‘^{x*) = g^{x*). If at least one of s or g is strictly convex, then x* is unique. 


The functions s and g defined in (|12p and (|10p satisfy the requirements for Theorem [TJ 
In fact, they are both positive-valued differentiable functions in the positive orthant R^ 
(which is an open convex cone) and they can be shown to be convex through some simple 
calculus. As a result of Theorem [H h is pseudoconvex function in R^^. Additionally, the 
strict convexity of g guarantees the uniqueness of x* defined in Corollary [2l 


3.2 Iterative minimization algorithm 

The problem (llip can be minimized using the following novel iterative algorithmic Given 
the current iterate x^'^f the next iterate is generated as: 


(m-l-l) _ 


A 


(m) 


where 


dm) _ 


= V,s(x(™)) = 


Ti 


ds(x) 


dx,; 


x—xim) 


(13a) 


(13b) 


The algorithm is iterated until a stopping condition is met, at which point the last iterate x 
is re-scaled to obtain the solution to the problem ([9]) as 0 = x/s(x). The pseudocode of the 
full algorithm — including the stopping condition that will be described in the following — 
is reported in Algorithm [2j 


1. Please notice that the superscript m now refers to the current iteration within the algorithm for the 
solution of m and is completely unrelated to the current iteration in the outer two-step block coordinate 
descent algorithm for the solution of the original MKL problem. 
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Algorithm 2: Solve elastic-net constrained weighted sum of reciprocal. 
Function SolveElNetWSR(/3, 0^^^) is 
// Initialization 

X •(— 0®; 

for m •(— 1 to maximum number of iterations do 
// Compute cost 

ni ^ _ 

r ^ + (1 

s ^ ^ni -I- r; 

9 ^ 

// Check convergence 

if m > 1 and s/g — 1 < e^sr then break; 

// Update iterate 

+ [(2)V + (l-^)x]/r; 

for A; ^ 1 to Q do Xk ^ 

end 

return x/s] 
end 


Whiie a fuii proof of the convergence of the aigorithm is provided in Section 13.31 the 
intuition behind it is sketched here. For ease of notation, we wiii hereafter drop the iteration 
superscript and refer to the current iterate as w = and to the next one as t 
The new iterate z generated from (jl3p can be interpreted as the soiution to the probiem: 


minimize 
subj. to 



q^x = p, 


(14) 


where p = Ej y/ In other words, the new iterate is generated by minimizing the function 
g on a hyperpiane which is perpendicuiar to the gradient of s at w. The specific choice of 
the offset constant, i.e. p in (jl4p . has an interesting geometricai interpretation. Because 
the functions s and g satisfy the requirements for Theorem (H for any positive c the point 
y = cw is such that s{y)g{y) = s{w)g{w) and aiso that p = q^x = Vs{y)'^x < s{x) Vx (see 
Theorem [4] beiow). Choosing c such that s{y) = cs{w) = p and substituting (fT^ in (fT^ . it 
is easy to show that the hyperpiane q'^x = p has the foiiowing properties: 


q = Vs{y) = -Vg{z), 

(15a) 

p = s{y) < s{x) and 

(15b) 

p = g{z) < g{x) Vx : q^x = p. 

(15c) 


In other words, this hyperpiane is externaiiy tangent to the ievei sets of s and g of the same 
vaiue, p. Asymptoticaiiy, the aigorithm finds the hyperpiane that is tangent to the two ievei 
sets at the same point. Aithough there is no guarantee that each step decreases both g and 
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s, the next section will show that their product h decreases monotonically at each step and 
that the algorithm, in fact, convergences towards the solution. 


3.3 Convergence analysis 

A fixed point for the iterative map (I13p is the point x* satisfying the conditions of Corol¬ 
lary [21 In fact, substituting q* = Vjs(x*) = —Vig^x*) = I3i/{x*)‘^ in the iterate update 
(|13al] makes it an identity. By Corollary [21 this fixed point is a global minimum for h and, 
therefore, a solution for (jllj) . 

To show that the algorithm (|13l) can be used to solve (|lip . it remains to be proven that 


the iterative map (1131) converges to its fixed point x* for all starting points 


pQ 


. To do 


so, we will make use o f convergence results of de scent algorithms (jZangwill Il969l : iMeveu . 


1976 : Bertsekas . 19991 : Luenberger and 20081') and in particular of Zangwill’s Global 
Convergence Theorem ( Luenberger and Ye . 20081 . p. 205), restated here for 


convenience. 


Theorem 3 (Global Convergence Theorem) Let A be an algorithm on A, and suppose 
that, given x®, the sequence is generated satisfying G A(x^™'^). Let a 

solution set T C A be given, and suppose: 

1. all points are contained in a compact set S G A, 

2. there is a continuous function f on A such that: 

(a) if X ^ T, then ({z) < <C(x) for all z G A{x), 

(b) if X gT, then C{z) < C{x) for all z G A{x), 

3. the mapping A is closed at points outside T. 


Then the limit of any convergent subsequence of {x ^™')} is a solution. 


The following will show that Theorem [3] applies to the mapping = A{x^^'^) corre¬ 
sponding to (jl3|) . This mapping is defined in A = has solution set T = {x*}. 


Since s is a differentiable con vex function in th e open convex set 




, _ it is actually 

continuously differentiable in M^t- ( Rockafellai . 1970l . Corollary 25.5.1). The specific choice 
of s in ([Top is such that is also strictly positive and, therefore, the iteration ([13]) defines 
a continuous function (point-to-point mappin g) from x^"^^ to S ince for a point-to- 

point mapping continuity implies closedness ([Luenberger and Yel . l2008[ . p. 206), the third 
condition of Zangwill’s theorem is satisfied. 

As a first step towards verifying the second condition, the following theorem is introduced 
(proof provided in Appendix IA.2p . 


Theorem 4 Given a norm s : M” —>■ M+ of the form s{x) = dollar 111 + \/di||x||f -|- d 2 ||a ^||2 
with do,di,d 2 > 0, the following property holds: 


Vs{y)'^x < s{x) < x'^Ay X Vx,y G M"", 
where Ay is a diagonal matrix whose i-th diagonal element is s{y)Vis{y)/yi. 


(16) 
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We can now write the following chain of inequalities showing that h is non-increasing at 
each step: 


h{x 


(m+l)', _ 


) = h{z) < s‘^{z) < z'^Ay z = ^ 


I Pi , .Qi 

— s{y)— 

Qi Vi 


= h{y) = h{w) = 


(17) 

where the first inequality follows from (|15p while the second one from Theorem HI Unfortu¬ 
nately, the fact that h is constant along rays out of the origin makes it unsuitable as function 


^ for Theorem HI (the strict inequality in condition 2. (a) is violated for points x = cx* with 
Instead, we consider the function 


c G 


C(x) = 2 h{x) + [s(x) - g{x)] = g^{x) + s^(x). 


(18) 


for which the following inequality can be readily obtained from (I15p . (I17p . and (118^ : 

C,{z) = g^{z) + s^{z) <2s^{z) <2 h{w) < C{w)- (19) 


Importantly, as prescribed by 2.(a) the expression (1191) holds with equality only if the 


starting point of the iteration (w in our case) is in the solution set T. This can be shown 
by first noticing that (^{z) = Q{w) implies g{z) = 5 ( 2 :) = g{w) = s{w). From the definition 
of y, we see that s{w) = g{z) ^ y = w. Since the restriction of g along q^x = p is strictly 
convex, the inequality in (|15cl) holds as equality only at the minimum, i.e. g{z) = g{y) => 
z = y. Putting these together, we obtain that z = w, which substituted in (I15ap finally 
yields Vs{'w) = —Vg{w), the condition defining the fixed point x*. This proves that the 
second condition of Zangwill’s theorem is also satisfied. 

Through some simple algebra, it is easy to show that Vjs(x) < 1 Vx,f. This, together 
with ([Top and (fT7P . leads to ^/]Ti < < s(x('"+^)) < ^/h(xP^ < y/h{x^^^). As 

a result, all the points of the sequence (with the immaterial possible exception of x^*^^) 
are contained in [miuj VA) ■\/h{x^^p which is a closed and bounded subset of M^i-, as 
prescribed by the first condition of the theorem. 

In conclusion, we have proven that the algorithm defined by the iteration (11311 satisfies 
the conditions of Zangwill’s theorem. Also, be cause the solution set T c onsists of a single 
point X*, the sequence converges to x* ( Luenberger and Yel . I 2 OO 8 I . p. 206). 


3.4 Stopping condition 


We now establish a lower bound on the optimal value of h, which will be used to provide 
a non-heuristic stopping criterion for the iterative algorithm in (jl3p . Given the solution 
X* and the new iterate obtained as described in Section [3.21 we observe that, since 

q and x* lie in the (strictly) positive orthant, there always exists c G M++ such a that 
q'^{cx*) = p, with p as in Section [321 Therefore, (|15cl) implies p < g{cx*) and Theorem H] 
yields p = q'^{cx*) < s{cx*). Combining these two inequalities gives p^ < g{cx*) s{cx*), 
which can be rewritten as y^(x^™^^^) < h{x*) where the equality only holds at the solution 
X*. As a result, h{x^^~^^'^) -y2(x( bounds how suboptimal the iterate is, even without 
knowing the exact value of h{x*). The following stopping condition guarantees a predefined 
relative accuracy Cwsr > 0: 


;j(3,(m+l)) _^2(^(m+l)) 

y2(x(™+l)) 


^(x(m-Hl)) 

y(x(™+i)) " ^ 


(20) 


















The algorithm terminates after an ewsr-suboptimal iterate is produced, i.e. when ([20l) is 
satished, which guarantees that — h{x*) < e^srh{x*). 


3.5 Alternative approaches 


This section presents a brief overview of alternative approaches that were devised by the 
author in the process of creating and improving the main method presented above. They 
are reported here because they may be advantageous in specific situations and for some 
values of the parameters. 

An approach to minimizing ([9]), which works particularly well when rj is small, is by 
using the alternative update: 

( (m) o \3 

^^ ) 

) 

instead of (jl3al) . For an appropriate choice oi p' > 0, this iterate is the solution to the 
problem: 



minimize 
subj. to 



x'^X = p ' , 


( 22 ) 


which is an analogous of (I14p using a quadratic constraint instead of a linear one. The 
iterative map defined by (j21h has the same fixed point as the map (I13ap and a convergence 
proof can be obtained using arguments similar to those in Section [3.31 In simulations, the 
convergence rate of the update rule pT]) . appears to be marginally better than (I13ap for 
small values of rj (less than approximately 0.25) and significantly worse otherwise. For this 
reason, it may be advantageous to use (I13ap when rj > 0.25 and alternate between (12X1) and 
(jl3ap when rj < 0.25. 

An algorithm for the solution of the pr oblem Q using a majorization-minimization 
(MM) procedure was presented in (Citi, 20151 ). Briefly, the algorithm is similar to coordinate 
descent but at each step — instead of performing a full line search to minimize has a, function 
of one of the optimization variables — it reduces it by minimizing a carefully designed 
surrogate function, called a majorizer, which can be solved in closed form. The number 
of iterations required to obtain a given accuracy is comparable to that of Algorithm [ 2 ] but 
each iteration requires roughly four times as many flops. 


4. Elastic-net constrained linear program 

This section introduces an efficient algorithm for finding the solution 6 of the elastic-net 
constrained linear program: 

maximize vJ 9 

subj. to r/ll^lli-h (1 - ??)|| 6»||2 < 1, (23) 

0^0 

with u ^ 0, u 7 ^ Oq and r] G [0,1]. As shown in Section [T3l a solution to this problem 
provides a lower bound on the optimal value of the original MKL cost function ([T|). 
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4.1 Algorithm 

In the special case r] = 1, the (possibly nonunique) straightforward solution to the problem 
is the vector e^, where k is such that = maxjUj. When t/ < 1, simple algebra shows 
that points in R+ satisfy the elastic-net constraint if and only if they also belong to the 
hyper-sphere with centre c and radius r, where 


d = iil{2-2r]), 
c = -dlQ, 
r = ^jQd^ + 2d + l. 

Therefore, the problem (l23l) is equivalent to: 

maximize vJO. 

esKj, 

\\e-c\\l<r'^ 

Let us now consider the point q\ 

q = ru/\\u \\2 + c, 


(24) 

(25) 

(26) 

(27) 


(28) 


which is the point of the hyper-sphere which is farthest away in the direction of u. If this 
point is also in R+, then 9 = qis trivially a solution for the optimization problem (1271) . If this 
is not the case, the important property that qk < 0 0^ = 0 (of which a proof is provided 

in Section I4.2p suggests a method to incrementally prune away coordinate directions that 
are guaranteed to be zero in the optimal solution 6. At each iteration m, the algorithm 
keeps track of the set of indices for which it has already been established that the 

corresponding element of 6 is null, i.e. fee =^9^ = 0. The set Z is initialized to the 
empty set 0 at the beginning of the algorithm and grows monotonically at each iteration. 
We denote as |Z| the cardinality of Z, as Z its complement and as the projection of 
u on the (Q—|Z|)-dimensional subspace spanned by coordinate directions corresponding to 
indices in Z. The algorithm generates the next iterate according to: 

(m) ^ ir^'^^Uk/\\uz(m)\\2-d, iffcEZ, 

\0 iffcEZ. ^ ^ 

This is the point of the |Z(”^)|-dimensional disc of radius = -\/|Z(”^)| -|- 2d -|- 1 and 

centre c^im) which is farthest away in the direction of u. If any of the elements of q^"^^ is 
negative, their indices are added to Z and the algorithm starts a new iteration, otherwise 
the algorithm ends and the last iterate is returned as the solution 9 to the elastic-net 
constrained linear program (1231) . The detailed algorithm is reported in Algorithm [3l 


4.2 Convergence analysis 

The fact that the greedy algorithm presented in Section 14.11 finds the global solution in a 
finite number of iterations stems from the property that if the algorithm produces an iterate 
with a negative component, the corresponding element of the solution must be zero: 

< 0 ^ = 0, y k,m. (30) 
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Algorithm 3: Solve elastic-net constrained linear program. 
Function SolveElNetLPCu) is 
// Initialization 
Z 

d^r]l{2- 2r]); 

do 

// Main loop 

r ■<— \/\Z\ (P +2d+ 1; 
dz ^ rUz/Wu^h - d] 

N ■(— {/c I g'fc < 0}; 

Qn 0; 

Z ^ ZUN; 
while A / 0; 
return q; 
end 


Aiming for a contradiction, let us assume that 6, with 9^ > 0, is a solution to (12311 and that 
at some point the algorithm produces the iterate with < 0. For conciseness, we 
denote simply as q, as Z, as p, and u^(m) as w, within this section. From 

(f2^ . it follows that qk < 0 implies Wk < ||rc|| d/p. Let us consider the point 


f = 9 — eek + Sw, with 0 < e < 9k and 6 = 


de 

p ||t(;| 


(31) 


and show that it satisfies the constraints of (EZI). Because e < 9k, 6 > 0, and tc ^ 0, then 
0 ^ 0 ^ 0' ^ 0. It is now sufficient to show that ||0 — c|p < r ^ \\9' — c|p < r: 

||0' — c|p = ||0 — c|p -|- \\5 w — e ekW"^ + 2 (6 w — e ek)'^{9 — c) 


2 

= ||0 — c||^ H- 2 —— 26 ewk + 2 — 1| —^ w ' {9^ — c^) — 2e9k — 2de 


p- 

12 , /o ,2 


d 6 -r . 

r v ' 

p ||t(;| 


< ||0 - c|r + {2e^ - 2e9k) - 26ewk + 2de 

< ||0 — c|p -|- 2de 


1^11 II ^z\ 

ll^ll P 


- 1 


(32) 


Vll^-cP - \Z\d? 


P 


-1 < ||0-c|| 


This proves that 9' is a feasible point for ([?7|) . Because ii^9' = vJ9 — euk + 6ii^w = 
vJ9 — ewk + pIi^ii II'^^IP > 9, the feasible point 9' improves over 9, which therefore cannot 
be a solution. This contradiction proves (f30]l . 


5. Conclusions 

This technical report focuses on an algorithm for the minimization of a positive-weighted 
sum of reciprocals bound to elastic-net constraints. This algorithm, explained in detail 
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in Sections I3.1H3.41 can be used to optimize the kernel weights within a two-step block 
coordinate descent alternating between the optimization of the SVM classifiers and the 
optimization of the kernel weights. Preliminary tests (not reported) of the computational 
cost of the algorithm show that it compares very favourably to existing and alternative 
approaches. Finally, because it does not depend on external libraries, it has a wide appli¬ 
cability and can be readily included in existing open-source machine learning libraries. 


Appendix A. Proofs of theorems 

The proofs of the theorems given in the text are rep orted in this appendix in the form 
of structured proofs as advocated by Leslie Lamport (j2012l i. Each assertion follows from 
previously stated facts, which are explicitly named to tell the reader exactly which ones are 
being used at each step. 


A.l Proofs of Theorem [T] and Corollary [2] 

Theorem 1 Let A C R” be an open convex cone and g,s : A ^ R_^ be differentiable 
convex functions such that s(cx) = cs(x) and g{cx) = g{x)/c for all c G R^ and x & A. 
Their pointwise product h{x) = s{x) g{x) is a pseudoconvex function in A. 


Proof 


1. To show that the differentiable function h : A ^ R^ defined in an open convex set is 
pseudoconvex, it suffices to assume for the remaining of this proof that; 


1.1. y,z e A, 

1 .2. h{z) < h{y), 


and prove that Vh{y)'^{z — y) < 0. 

Proof; By the definition of pseudoconvex function ( Cambini and Martein . 20081 . defini¬ 
tion 3.2.1). 


2 . Vx G d ; Vg{x)'^x = —g{x). 

Proof; By differentiating g{cx) = g{x)lc w.r.t. c and evaluating it for c = 1. 


3. Vx G d ; Vs(x)'''x = s{x). 

Proof; By differentiating s{cx) = cs{x) w.r.t. c and evaluating it for c = 1. 

4. Vx G d ; Vh{xyx = ( 7 (x)Vs(x)'''x -|- s{x)Vg{x)^x = 0. 

Proof; Follows directly from [2] and [3l 


5. Given and y as in [nidcGR^., such that the point z' = cz satishes h{z') = h{z) and 
s{z') = s{y). 

Proof; For any positive c the corresponding z' is in A (because d is a cone) and satisfies 
the first condition; h{z') = s{cz) g{cz) = cs{z)g{z)/c = h{z). We choose c = s{y)/s{z) 
which also satisfies the second condition; s(z') = s{y)/s{z) s(z) = s{y). 


12 












6. y x,x' ^ A: 'y/s{xy x' < s{x'). 


Proof: The first-order conditions for convexity ( Bovd and Vandenbergh^ . 20091 . ch 3.1.3) 
imply s{x') > s{x) + Vs{xY{x' — x). Substituting [3] and rearranging yields[6l 


7. y x,x' ^ A\ Vg{xyx' < g{x') — 2 g{x). 

Proof: The first-order conditions for convexity imply g{x') > g{x) -|- Vg{x)^{x' — x). 
Substituting [2] and rearranging yields [71 


8. Vh{y)^z' < 0. 

Proof: By [6l[71[5] and 11.21 we have: 


Vh{yYz' = g{y) Vs(?/)’^z' -t- s{y) Vg{yYz 

< g{y) s{z') -H [s(y) g{z') - 2 s{y) g{y)] 

= h{v) +Kz') - 2h(y) 

= h{z') - h{y) = h{z) - h{y) < 0. 


9. Q.E.D. 

Proof: By El El and (H we have: 

cVh{y)'^z < 0 ^ Vh{y)'^z = Vh{y)'^{z — y) < 0. 
ByEl the latter proves the theorem. 


Corollary 2 Under the eonditions of Theorem{I[ all points x = cx* — with c G M++ and x* 
satisfying \/s{x*) = —Vg{x*) — are global minima for the funetion h, where it takes value 
h{cx*) = s^(x*) = g^{x*). If at least one of s or g is strictly convex, then x* is unique. 


Proof 


10. Vs{x*) = -Vg{x*) ^ s{x*) = g{x*). 

Proof: Follows immediately from the statements El and El of the proof of Theorem [H 


11. X* is a critical point for h. 

Proof: From the condition Vs(x*) = —'Vg{x*) and from statement [101 V/i(x*) = 
g{x*) Vs(x*) -|- s{x*)'Vg{x*) = 0. 


12. X* is a global minimum of h. 

Proof: Because x* is a critical point (statement I12D of a ps e udoco nvex function (The¬ 
orem [1]), it is also a global minimum (jCambini and Marteinl . [20081 . theorem 3.2.5). 
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13. If at least one of s or s' is strictly convex, then x* is unique. 

Proof: Aiming for a contradiction, let us assume that there is a point x E A\{x*} such 
that Vs(f) = —Vg{x). By using the same reasoning as in[Tni this implies s{x) = g{x). 
Without loss of generality, let us assume that s{x*) > s{x) and that s is strictly convex. 
From the first-order conditions for (strict) convexity, we obtain: 

s{x) > s{x*) + Vs{x*)'^{x — X*) Vs(x*)'''(x — X*) < 0, (33) 

gix)>g{x*) + Vg{x*)'^{x-x*) => Vs{x*)'^(x - x*) > 0, (34) 

which is obviously a contradiction. 

14. Q.E.D. 

Proof: From fTOl 11211131 and the definitions of s, g, and h in the statement of Theorem[TJ 


A.2 Proof of Theorem |4] 

Theorem 4 Given a norm s : M"" —M+ of the form s{x) = (io||3:||i + Y^di||x||^ + (i2||2^|l2 
with do,di,d 2 > 0, the following property holds: 

Vs{y)'^x < s{x) < ^Jx'^AyX Vx,yEM"', (16) 

where Ay is a diagonal matrix whose i-th diagonal element is s{y)Vis{y)/yi. 


Proof 

1. Vy E M” : Vs{y)'^y = s{y). 

Proof: Since s is a norm, s{cy) = |c|s(y). By differentiating both sides w.r.t. c and 
evaluating it for c = 1, we obtain the statement [TJ 

2. Vy,x E M" : Vs{y)'^x < s{x). 

Proof: The first-order conditions for convexity ( Bovd and Vandenberghe . 20091 . ch 3.1.3) 
imply s(x) > s{y) + Vs{y)'^{x — y). Substituting [T] and rearranging yields [2l 


3. Define r : 


4. Vy, X E 


as r(y) = ^J(^{\\y\\{ + d 


I (i^||x||f -|- dlll^lli 1 


r{y) , dl\\x\\l + dl\\x\\l ||y||i 

“T II iiO 


l|y| 


r(y) 


Proof: From the inequality ^/z < ^[^/^ + zj which in turn results from the 
concavity of the square root function. 


5. Vy,x E M" : s^(x) < s(y) 


do 


Kc 1 + 


di 


+ 


do 


|i r{y) r(y) 

Proof: Follows from writing out the Ihs explicitly using the definition of s and then 
exploiting the statement in 01 
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6. Vy, X E 


do 


I l|2 I 

|x||i + ■ 


r{y) 


I l|2 I ^2 

|x||i + ■ 


r{y) 


mh < 


2 

VN * 


+v4%%4+e 

^ r{y)\yi\ ^ 


d2 , 

r(y)‘ 


Proof: The last term of each side of the inequality is identical. Applying Radon’s 
inequality it is easy to show that the each one of the first two terms of the Ihs is bounded 
by the corresponding term in the rhs. 

7. Vy,x E : s‘^{x) < x'^AyX. 

Proof: Follows from combining [5] and [U then using the definition of Ay. 

8. Q.E.D. 

Proof: Combining [2] and [3 proves the theorem. 
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